feat: add compression module with zstd/gzip/lzma support#165
feat: add compression module with zstd/gzip/lzma support#165cluster2600 wants to merge 1 commit intoalibaba:mainfrom
Conversation
- Add compression module supporting zstd, gzip, and lzma codecs - Add compression parameter to CollectionSchema for storage optimization - Add compression integration module for end-to-end vector compression - Add streaming compression API for large datasets - Enable RocksDB compression with runtime codec detection (ZSTD → LZ4 → Snappy → none) - Add comprehensive compression documentation and tests The RocksDB compression uses GetSupportedCompressions() to detect available codecs at runtime, preventing crashes when ZSTD is not linked with the binary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for putting together this compression proposal — we appreciate the effort and thought you've put into the design! Our system is built around a C++ core, with Python (and other languages) serving as bindings to this vector search engine. Because of this, we generally aim to keep data-handling logic in the C++ engine, so all bindings benefit uniformly and behavior remains consistent. Are there particular use cases or constraints that make compression in Python necessary or advantageous? I'm also curious about the expected benefit for vector data specifically. High-dimensional floating-point vectors are typically not very compressible with general-purpose algorithms. For storage efficiency, we already support quantizations (though they hurt recall), which are typically more effective for vectors than byte-level compression. Any additional context you can share about the problem you're trying to solve would be very helpful! |
|
Thanks for the thoughtful feedback @zhourrr! You raise valid points — keeping data-handling logic in the C++ core makes sense for consistency across all language bindings, and I agree that general-purpose compression has limited benefit for high-dimensional float vectors compared to the quantization you already support. I'll close this one out. If compression at the C++ engine level ever becomes useful, happy to contribute there instead. Appreciate the review! |
Summary
CollectionSchemawith compression configurationChanges
Python
python/zvec/compression.py— Core compression with pluggable backendspython/zvec/compression_integration.py— Collection-level compression integrationpython/zvec/streaming.py— Streaming compression/decompression APIpython/zvec/model/schema/collection_schema.py— Compression config in schemapython/zvec/__init__.py— Export compression moduleC++
src/db/common/rocbsdb_context.cc— Runtime detection of supported compression codecs usingrocksdb::GetSupportedCompressions(), with fallback chain: ZSTD → LZ4 → Snappy → NoneTests
python/tests/test_compression.pypython/tests/test_compression_integration.pypython/tests/test_schema_compression.pypython/tests/test_streaming.pyDocs
docs/COMPRESSION.md— Compression usage and configuration guideContext
Split from #157 to isolate compression feature from CI and GPU work.
Test plan
rocbsdb_context.cc